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Abstract — It is proved that maximizing the mutual information 
over all source distributions with a certain average power or 
over the larger set of source distributions with upperbounded 
average power yields the same channel capacity in both cases. 
Hence, the channel capacity cannot decrease with increasing 
average transmitted power. These fundamental properties hold 
for any continuous-input, continuous-output single-user channel, 
including channels with severe nonlinear distortion. 

Index Terms — Average power constraint, channel capacity, 
constrained capacity, mutual information, nonlinear capacity, 
nonlinear distortion, optical communications, Shannon capacity. 



I. Introduction 

IN THE MOST cited paper in the history of information 
theory 0, Shannon proved that with adequate coding, 
reliable communication is possible over a noisy channel, as 
long as the rate does not exceed a certain threshold, called 
the channel capacity. He provided in 1948 a mathematical ex- 
pression for the channel capacity of any channel, based on its 
statistical properties. The expression is given as the supremum 
over all possible source distributions of a quantity later called 
the mutual information 0, 0. The channel capacity is often 
studied as a function of the average transmitted power. This 
function is obtained by optimizing the mutual information over 
all source distributions whose average second moment is either 
equal to the given power or upperbounded by this power — 
the convention differs between disciplines. We will return to 
the distinction between the two definitions at the end of this 
section. 

For linear channels with additive, signal-independent noise, 
the channel capacity is an increasing function of the trans- 
mitted power. The most well-known example is the additive 
white Gaussian noise (AWGN) channel, for which the channel 
capacity is known exactly 0] Sec. 24], Ch. 9]. In recent 
years, the problem of calculating or estimating the channel 
capacity of more complicated channels has received a lot 
of attention (see surveys in 0-Q). Due to the absence of 
exact analytical solutions and the computational intractability 
of optimizing over all possible source distributions, most 
investigations of the channel capacity of non-AWGN channels 
rely on bounding techniques and asymptotic analysis. 

If only noncoherent detection is available at the receiver, the 
channel capacity can be analyzed by including a magnitude 
operation at the output of a discrete-time complex AWGN 
channel. The channel capacity is in this case not known 
exactly, but it increases logarithmically with transmitted power 
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as approximately half the regular AWGN channel capacity [8), 
J9] Sec. 11.2]. The same behavior has been shown for the 
phase-noise channel, in which the transmitted signal is subject 
to a uniformly random phase shift before the Gaussian noise 
is added 0, iflOl ; indeed, according to |5j], these two channel 
models are equivalent in terms of channel capacity. 

For the Rayleigh-fading channel, the channel capacity in- 
creases logarithmically with power, with just an asymptotic 
offset to the AWGN channel capacity, if the receiver has full 
channel state information ifTTIl . fPHl . The increase is doubly 
logarithmic if no channel state information is available lfl2l - 
04). The results have been extended to other wireless channel 
models, including Rician fading, systems with transmitter-side 
channel state information, and multiple-antenna channels, see 
El, El, CH Sec. 4.2-4.3, 10.3, 14.5-14.7] and references 
therein. In all these cases, the channel capacity is an increasing 
function of the transmitted power. 

Of particular interest for this paper is the type of non- 
linear distortion encountered in fiber-optical communications 
ifTTl . ifTHl Sec. 7.2]. The impact of this nonlinear distortion 
increases dramatically with the transmitted power, to the 
extent that communication becomes virtually impossible if the 
instantaneous power is high enough lfl9l . ll20l Ch. 9]. This 
phenomenon is well known from experiments and simulations. 
Thus one might expect that the mutual information and chan- 
nel capacity would approach zero at sufficiently high power. 

If the mutual information is computed for a given source 
distribution, or optimized over a subset of all possible source 
distribution, a lower bound on the channel capacity is obtained. 
Numerous such lower bounds have been derived for nonlinear 
fiber-optical channels. The earliest bounds for optical chan- 
nel capacity assumed a Gaussian source density, for which 
the mutual information can be calculated or lowerbounded 
analytically 0, J2TJ-|26|. If the source is constrained to 
a ring with constant amplitude, another bound is obtained, 
which is stronger under some conditions 11271 . ||9] Sec. 11.4]. 
In recent studies, the mutual information has been optimized 
numerically for concentric multiring constellations 0, 11281 - 
11331 . which yields other bounds on the channel capacity. Inter- 
estingly, all these lower bounds show the same general trend: 
As the average power (or signal-to-noise ratio) increases, they 
increase towards a peak, and then they decrease again towards 
zero as the power is further increased, similarly to most of the 
curves in Figs.|2H4] This is not unexpected since, as mentioned 
above, the severity of the nonlinear distortion increases with 
power. In contrast, 11341 and 11351 indicate that the channel 
capacity may increase monotonically for certain nonlinear 
optical channels. 

Some of the lower bounds mentioned above are derived for 
single-user systems and others for multiuser systems. In optical 
communications, a multiuser system is sometimes modeled as 
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a channel where the distortion, representing interference from 
other users, changes depending on the transmitted power — i.e., 
a source-dependent channel. Such channels will not be con- 
sidered in this paper. We focus on single-user systems, where 
the channel can be represented by a single, fixed probability 
density function (pdf) of the channel output conditioned on the 
input, which is the scenario considered in Shannon's original 
work |fl~). 

Apart from lower bounds, not much is known about the 
channel capacity of nonlinear fiber-optical channels. To the 
author's knowledge, no generally accepted expressions are 
available for the exact channel capacitjQ and no upper bounds 
are known, apart from the standard channel capacity for the 
linear AWGN channel, which neglects all nonlinear distortion. 
Unfortunately, the numerous articles about lower bounds have 
often been cited in terms of just channel capacity (or capacity, 
spectral efficiency, information spectral density, etc.), without 
mentioning that the cited results are bounds. Therefore, there is 
a wide-spread belief in the optical community that the channel 
capacity of channels with strong nonlinear distortion increases 
with power to a certain maximum value and then decreases 
again towards zero. 

In this paper, we prove rigorously that the channel capac- 
ity is an increasing (but not necessarily strictly increasing) 
function of the average power, thus disproving the standard 
belief of a peaky behavior in the single-user case. There is no 
contradiction between this result and the numerous decreasing 
lower bounds referenced above, but we do recommend some 
caution in drawing conclusions about the true channel capacity 
from the behavior of its lower bounds alone. It is beyond 
doubt that the mutual information can decrease with power, 
as do many lower bounds on the channel capacity, but not 
channel capacity in the Shannon sense. This fundamental 
theorem is proved for a general continuous-input, continuous- 
output channel, not confined to any particular channel model 
or application. 

The theorem holds regardless of whether the given power 
level is interpreted as the exact second moment of the source or 
an upper bound thereof. The proof is developed assuming the 
former definition, and it is trivial for the latter. An interesting 
consequence of the increasing channel capacity is that the two 
definitions of channel capacity are fully equivalent. 

II. Mutual Information, Constrained Capacity, 
and Channel Capacity 

For any random vectors Y and X of length n, the (dif- 
ferential) entropy h(Y) and conditional entropy h(Y\X) are 
defined as CO Sec. 20] 

h(X)±- [ f Y (y)logf Y (y)dy, (1) 

JR™ 

h(Y\X)±- [ f fx Y (x,y)\ogf Ylx (y\x)dxdy, (2) 

where fy(y) denotes the pdf of Y, fY\x(y\ x ) is the con- 
ditional pdf of Y given X, and fxy{x,y) is the joint pdf 

'Exact expressions were proposed in [361 and in 1371 , 1381 , but their validity 
was questioned in |6) and 1341 . resp. 



of X and Y. All logarithms are to base 2 and OlogO should 
be interpreted as 00 The mutual information in bits/symbol 
between X and Y is defined as El, 0, Sec. 8.5] 

I(X;Y)±h(Y)-h(Y\X). (3) 

Let X and Y represent the input and output, resp., of 
a communication channel. The joint pdf fxY{x,y) can be 
factorized as f X Y{x,y) = fx(x)f Y \x(y\x), where f x 
represents the source and Jy\x represents the channel. The 
source pdf fx is usually chosen to match a certain channel 
fy\x> but the converse is not realistic for single-user channels: 
the channel should be represented by the same function 
fY\x(y\ x ) regardless of the source. For any given source pdf 
fx(x), the mutual information can be calculated from (Q]!-@ 
using f Y (y) = J fxY(x 7 y)dx. 

The supremum of these mutual informations, over all pos- 
sible source pdfs fx, is the channel capacity JT] Sec. 23], 
11391 , E] p. 274]. In this paper, we study the channel capacity 
C as a function of the average transmitted power P, defined 
as the second moment of the source distribution. This is a 
special case of a capacity-cost function, where cost is the 
average transmitted power. The function can be defined in two, 
subtly different, ways, depending on whether the power (cost) 
is upperbounded by P or exactly P. In the first case, which 
is most common in classical information theory ll40l Ch. 7], 
BP . ||4] Ch. 9], the channel capacity is defined as 

C\P)± sup I(X;Y), (4) 

/xen'(p) 

where fi'(P) is the set of all pdfs^l over K" such that 

/ \\x\\ 2 f x (x)dx<P (5) 

JR™ 

In the second case, which is prevalent in optical information 
theory |2T| . |23l , the channel capacity is 

C(P)± sup I{X-Y), (6) 

where f2(P) is the set of all pdfs over K n such that 
/ \\x\\ 2 f x (x)dx = P. 

JR" 

In both cases, fY\x(y\ x ) is assumed to be independent of 
P; i.e., the channel statistics do not depend on the source 
statistics. 

Since Q'(P) D Q(P), C'{P) > C{P) for all P and all 
channels. Furthermore, since H,'(P 2 ) 3 J7'(Pi) for Pi > P 2 , 
C'(P) is a nondecreasing function of P ll42l p. 3-21]. In 
this paper, "channel capacity" refers to C(P) unless otherwise 
stated. However, the two definitions are in fact equivalent, as 
will be shown in Theorem |2] 

If the optimization of I(X; Y) is instead done over a subset 
of O(P) (or fi'(P)), a constrained capacity is obtained. Many 
versions of constrained capacity have been studied in the past, 
such as confining X to a certain range or to a certain discrete 
constellation. 

2 This convention can be made rigorous by confining the integrals to the 
support of the involved random variables. 

3 The distribution may be discrete, continuous, or mixed. 
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To summarize the terminology used in this paper, we will 
use "mutual information" when no optimization is carried out, 
"constrained capacity" when the optimization is over some, but 
not all, possible source distributions, and "channel capacity" 
when the optimization is over all possible source distributions. 
Thus, the mutual information between input and output is 
a property of the channel and the source, the constrained 
capacity is a property of the channel and the source constraints, 
and the channel capacity is a property of the channel alone. To 
avoid confusion, we will not use just "capacity" in this paper. 

III. The Law of Monotonic Channel Capacity 

We are now ready to state the main result. It is given 
in terms of an arbitrary discrete-time, memoryless, vectorial 
channel. This general channel model includes the discrete- 
time channel with memory, if the dimension (block length) is 
chosen large enough (39), and also the continuous-time band- 
limited channel, because bandlimited waveforms can without 
loss be represented by a vector of its samples 0] Sec. 23]. 

Theorem 1 (Law of monotonic channel capacity): C(P) is 
a nondecreasing function of P for any channel f Y \x(y\ x )- 

Proof: Let, for a given Pi > 0, fx 1 e fi(Pi) be a pdf 
that attains the channel capacity and Y\ be the channel output 
when the input is X\. The distribution of Y\ is determined 
by fYi\xAv\ x ) = fy\x(y\x). Thus 

/(Xi;l r i)=C7(P 1 ). 



Let g be any pdf over R™ with nonzero power 



P g = \\x\\ 2 g(x)dx > 0. 



For any P 2 > Pi and < e < 1, let 

/*,(*) = (1 -£)/*,(*) + eM9aO, 

where 



(7) 



(8) 



eP„ 



P 2 - (1 - e)P 1 



is a real constant. With this construction, fx 2 <= ^(P 2 ) 
because 

/ fx 2 (x)dx = (l-e) f Xl (x)dx 

Jl™ JR™ 

(3g{(3x)dx 



1-e + e 
1 



and 



\xffx 2 (x)dx = (l-e) J \\x\\ 2 f Xl (x)dx 
+ 7^ / P 3 \\x\\ 2 g(f3x)dx 

P JR™ 



(l-e)P 1 + -^P 5 



(1 - e)Pi + 
P 2 - 



P 2 - (1 - e)P 1 

Pn 



Pa 



Let Y "2 denote the output of the studied channel when the 
input is X 2 , i.e., f Y2 \x 2 {y\x) = f Y \x{y\x). The entropy 
(Q3 of Y 2 can be calculated as 



h(Y 2 ) = -l f Y2 (y)logf Y2 (y)dy 

= - J J fx 2 (x)f Y \x(y\x)\ogf Y2 (y)dxdy 

= j I ' fxA x )fY\x(y\x)\ogf Y2 (y)dxdy 

- e j J ' Pg{Px)f Y \ X {y\x) log f Y2 (y)dxdy 

= -(l-e) J fY t {y)\ogf Y2 {y)dy 

-e J Pg(Px) J f Y \ X (y\x)\ogf Y2 (y)dydx. 

(9) 

This entropy can be lowerbounded using the information 
inequality 0] Sec. 20], J4] p. 252], according to which 

J p(x) log q(x)dx < J p(x)\ogp(x)dx 

for any pdfs p(x) and q(x). Applying this inequality twice in 
© yields 

h(Y 2 ) > -(1-e) J f Yl (y)logf Yl (y)dy 

-e J Pg(Px) J f Y \x{y\x)\ogf Y \ X {y\x)dydx 
= (l-e)h(Y 1 ) 

- e j J ' Pg(f3x)f Ylx (y\x)\ogf Yl x(y\x)dxdy. 

(10) 

Furthermore, the conditional entropy (01 of Y 2 given X 2 is 
h(Y 2 \X 2 ) 

fx 2 (x)f Y \x(y\x) log f Y \ x {y\x)dxdy 
= - ( 1-e )/ / fx 1 {x)f Y \x(y\x)\ogf Ylx (y\x)dxdy 

= -(l-eJ^il-Xi) 

- e J J Pg(Px)f Ylx (y\x) log f Ylx (y\x)dxdy. 

(11) 

Combining (TTOb and (fTTT i in OJ yields 

I(X 2 ;Y 2 ) = h(Y 2 ) - h(Y 2 \X 2 ) 

> (l-e)/i(F 1 )-(l-e)/ l (F 1 |X 1 ) 
= (l-e)I(X 1 ;Y 1 ). (12) 



4 



Preprint, September 19, 2011 



The channel capacity can now be bounded as 

C(P 2 )= sup I(X;Y) 
f x en(p 2 ) 

> sup I(X 2 ;Y 2 ) 

0<c<l 

> sup (l-e)/(Xi;Fi) 

0<c<l 

= I(X 1 ;Y 1 ) 
= C(Pi), 



a(x) 



(13) 



where the first inequality follows from fx 2 <= ^(^2) and the 
second from (T% . To summarize, if P 2 > Pi, then C(P 2 ) > 
C(Pi), which completes the proof. □ 
A practical interpretation of the theorem is that it is possible 
to waste power without sacrificing channel capacity. Even 
though this is a huge improvement over previous results, 
where the channel capacity was believed to decay to zero, a 
system designer would not be too excited over the possibility 
to waste power without gaining anything. However, (fT3l is 
only a lower bound, obtained when the source distribution has 
the special form (|8). This form was chosen because of its 
general applicability to any channel f Y \x(y\ x )^ but it is also 
not optimized for any channel. If ([8]) would be replaced by 
a distribution optimized for a given channel and transmitted 
power, I(X 2 ;Y 2 ) may increase, and a tighter lower bound 
than ( fT3l ) would be obtained. This would make the channel 
capacity strictly increasing with power, which in addition to its 
theoretical significance may have practical implications for the 
design of communication systems operating in the nonlinear 
regime. 

An immediate consequence of TheoremQ] given by the next 
theorem, is that the power-limited channel capacity C'(P) is 
achieved by a source distribution fx for which the second 
moment equals the maximum allowed value P. This means 
that the two definitions © and (O are equivalent. 

Theorem 2: For any channel and any P, C'(P) = C{P). 
Proof: The definition (|4|i can be written as 



C'(P) = sup C(Pi), 
Pi<p 

which by Theorem [T] is equal to C(P). 



IV. Numerical Examples 



In this section, examples will be presented for mutual 
information, constrained capacity, and channel capacity as 
functions of the average transmitted power, where the mutual 
information and constrained capacity have peaks but the 
channel capacity, as predicted by Theorem[T] is nondecreasing. 

A. A Nonlinear Channel 

We consider a very simple channel with nonlinear distortion 
and noise, represented as 



Y = a(X) + Z, 



(14) 



where X and Y are the real, scalar input and output of the 
channel, resp., a(-) is a deterministic, scalar function and Z 
is white Gaussian noise with zero mean and variance a\. For 
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Fig. 1 . A simple example of nonlinear distortion, given by (18) for a ma x = 
10. The channel is essentially linear for small |x| and binary for large \x\. 



a given channel input x, the conditional pdf of the channel 
output is 



jY\x(y\x) = — fa 



(15) 



where fo( x ) = (l/v^r) cxp(— x 2 /2) is the zero-mean, unit- 
variance Gaussian pdf. The corresponding conditional entropy 
for a given source pdf fx(x) is given by © as 



h(Y\X) 



fx(x) / f Y \x(y\x)logf Y \x(y\x)dydx. 



Since fy\x{y\x) is Gaussian for a given x, the inner integral 
is CD Sec. 20], 1U Sec. 8.1] 

fv\x{y\x) log fy\ x {y\x)dy = -ilog27recr| 

> * 

independently of x and hence 



h(Y\X) = ilog27real. 



(16) 



The mutual information for this channel and any source pdf 
fx{x) is I(X;Y) = h(Y) - h{Y\X), where 



h(Y) 

My) 



—00 
00 



f Y (y)\ogf Y {y)dy, 
fx(x)f Y \x(y\x)dx, 



(17) 



^ fy\x(y\x) is given by ( fl5t . and h(Y\X) is given by dl6l . 



1 

As an example, we select a(x) in ( TBI as a smooth clipping 
function 



a(x) = a max tanh 



(18) 



where a max > sets an upper bound on the output. This chan- 
nel is chosen for its simplicity and because its characteristics 
serves to illustrate the Law of monotonic channel capacity 
(Theorem[T|i, not for its resemblance to any particular physical 
system. If the instantaneous channel input X has a sufficiently 
high magnitude, the channel is essentially binary. For X close 
to zero, on the other hand, the channel approaches a linear 
AWGN channel. 

The channel parameters are a max = 10 and gz = 1 
throughout this paper. The function a(x) in JT8] ), which 
represents the nonlinear part of the channel ( fT4] i. is shown 
in Fig. Q] 
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Fig. 2. Mutual information for the nonlinear channel in < I4i with a ma x = 10 
and uz = 1, when the source pdf is Gaussian, uniform, and exponential. The 
AWGN channel capacity is included for reference. 




Fig. 3. Mutual information for the same channel, when the source follows 
various discrete distributions. The source probabilities are uniform. 



B. Mutual Information 

The mutual information I(X; Y) is evaluated by numerical 
integration, as a function of the transmitted power P. No 
optimization over pdfs is carried out. The source pdf fx(x) 
is constructed from a given base pdf g(x), rescaled to the 
desired power P as fx(x) — fidifix), where /3 = \J P g jP 
and P g is given by ((7). The results are presented in Fig. [2] 
for three continuous source pdfs fx(x): zero-mean Gaussian, 
zero-mean uniform, and single-sided exponential, defined as, 
respectively, 

fxAx) = 7P fG {7P 



fx 2 (x) 



fx 3 (x) 



275F' -VW< X <V3P, 
0, elsewhere, 

/| e -,V^ j x > 0i 



o. 



x < 0. 



At asymptotically low power P, the channel is effectively an 
AWGN channel. In this case, the mutual information is gov- 



erned by the mean value of the source distribution, according 
to |43l Th. 7]. All zero-mean sources achieve approximately 
the same mutual information, which approaches the AWGN 
channel capacity. The asymptotic mutual information for the 
exponential distribution, whose mean is ■J P/2, is half that 
achieved by zero-mean distributions. 

The mutual information curves for all three source pdfs 
reach a peak around P = 100, when a large portion of 
the source samples still fall in the linear regime of the 
channel. When the average power P is further increased, the 
mutual information decreases towards a value slightly less 
than 1 asymptotically for the zero-mean sources and for 
the exponential source. The asymptotes are explained by the 
fact that at high enough power, almost all source samples fall 
in the nonlinear regime, where the channel behaves as a 1-bit 
noisy quantizer. 

Similar results for various discrete source distributions are 
shown in Fig. [3] The studied one-dimensional constellations 
are on-off keying (OOK), binary phase-shift keying (BPSK), 
and M-ary pulse amplitude modulation (M-PAM). The con- 
stellation points are equally spaced and the source samples X 
are chosen uniformly from these constellations. The mutual 
information for M-PAM constellations with M > 4 exhibits 
the same kind of peak as the continuous distributions in Fig.|2j 
indeed, a uniform distribution over equally spaced Af-PAM 
approaches the continuous uniform distribution as M — > oo. 

Similarly to the continuous case, the zero-mean discrete 
sources approach the AWGN channel capacity as P — >• 0. Half 
this channel capacity is achieved by the OOK source, which 
has the same mean value yP/2 as the exponential source 
above. The asymptotics when P — > oo depends on whether M 
is even or odd. For any even M, the channel again acts like a 1- 
bit quantizer and the asymptotic mutual information is slightly 
less than 1 . For odd M, however, here exemplified by 3-PAM, 
there is a nonzero probability mass at X = 0, which means 
that the possible outputs are not only Y = ±a max + Z but 
also Y = 0+Z. Hence, the channel asymptotically approaches 
a ternary-output noisy channel, whose mutual information is 
upperbounded by log 3 = 1.58. 

To summarize, this particular channel has the property that 
the mutual information for any source distribution approaches 
a limit as P — > oo, and this limit is upperbounded by 
log 3. It might seem tempting to conclude that the channel 
capacity, which is the supremum of all mutual information 
curves, would behave similarly. However, as we shall see 
in Section IIV-DI this conclusion is not correct, because the 
limit of a supremum is in general not equal to the supremum 
of a limit. Specifically, the asymptotical channel capacity is 
limp^oo C(P) = limp-j.00 sup g I(X; Y), which is not equal 



to sup g limp_ 



I(X;Y) <log3. 



C. Constrained Capacity 

The standard method to calculate the channel capacity 
of a discrete memoryless channel is by the Blahut-Arimoto 
algorithm |gj Sec. 10.8], J44] Ch. 9]. It has been extended 
to continuous-input, continuous-output channels in B31 , l46l . 
Our approach is most similar to [46], in which pdfs are rep- 
resented by lists of samples, so-called particles. We consider 
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P 



Fig. 4. Constrained capacities for the same channel, where the source is 
constrained to a given constellation but the source probabilities are optimally 
chosen for each P. The colored curves indicate the nonoptimized mutual 
informations from Fig. [5] (uniform probabilities). 



Fig. 5. Channel capacity for the same channel. All mutual information 
and constrained capacity curves from Figs. 05] are included for reference 
(colored). Even though most mutual information and constrained capacity 
curves decrease, the channel capacity does not. The three markers refer to 
distributions in Fig. [6] 



a source pdf of the form 

N 

fx(x) = ^Wi6(x-Ci), (19) 

i=l 

where §(■) is the Dirac delta function, N is the number 
of samples, c = (ci,...,cjv) are the samples, and w = 
(wi,...,wn) are the probabilities, or weights, associated 
with each sample. If N is large enough, any pdf can be 
represented in the form ( fl9] l with arbitrarily small error. With 
this representation, 

2 — 1 X 7 

which when substituted in ( fl7l i yields h(Y), and thereby 
I(X;Y), by numerical integration. 

The objective for the optimization is to maximize the 
Lagrangian function 

L(c, w, Ai, A a ) = h(Y) + X, ^ Wi - lj 

+ A 2 ($>' C ?- P ) , 

where the Lagrange multipliers Ai and A2 are determined to 
maintain the constraints Yli w i = 1 an< ^ J2i w i c i — P during 
the optimization process. The gradients of L with respect to 
c and w are calculated, and a steepest descent algorithm (or 
more accurately, "steepest ascent") is applied to maximize L. 
In each iteration, a step is taken in the direction of either of the 
two gradients^ The step size is determined using the golden 
section method f47l pp. 271-273]. Constrained capacities were 
obtained by including additional constraints on c and/or w. 

4 Moving in the direction of the joint gradient turned out to be less efficient, 
because for small and large P, the numerical values of c and w are not of 
the same order of magnitude. 



Several initial values (c, w) were tried, and N was increased 
until convergence. 

The topography of L as a function of c and w turned out to 
include vast fiat fields, where a small step has little influence 
on L. This made the optimization numerically challenging. No 
suboptimal local maxima were found for the studied channel 
and constraints, although for nonlinear channels in general, 
the mutual information as a function of the source distribution 
may have multiple maxima^ 

Using this optimization technique, some constrained capac- 
ities are computed. Specifically, we investigate how much 
the mutual information curves in Fig. [3] can be improved 
if the source samples X are chosen from the constellation 
points c with unequal probabilities w, so-called probabilistic 
shaping. The constellations are the same as before, equally 
spaced OOK, BPSK, and M -PAM, but the probabilities of 
each constellation point is allowed to vary. For each power 
P, the mutual information is maximized over all probabilities. 
The constellation is scaled to meet the power requirement but 
otherwise not changed. 

The results are shown in Fig. |4] for the same channel 
as before ( (TBI) with (TT~8T > and parameters a max = 10 and 
o z = 1)- The BPSK performance offers no improvement over 
the mutual information of uniform BPSK in Fig. because 
equal probabilities turn out to be optimal for all P. However, 
the constrained capacity of OOK with optimal probabilistic 
shaping is about twice the mutual information of uniform OOK 
at low P. The improvements for 3- and 4-PAM are marginal, 
whereas the performance of 8-PAM is significantly improved 
for medium to high power, and its peak increases from 2.28 to 
2.37 bits/symbol. The general trends, however, are the same as 
for the mutual information in Fig. [3] The constrained capacity 
for any probabilistically shaped A/-PAM system with AI > 4 

5 An exception occurs when the constellation points c are fixed and the 
only constraint is Wj = 1. In this special case, the mutual information is 
a concave function of w for any channel j4] pp. 33, 191] and there is thus a 
unique maximum. 
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Fig. 6. Capacity-achieving source distributions for P = 10, 100, and 1000. 



displays a prominent peak around P = 100, after which 
the constrained capacity decreases again towards the same 
asymptotes as in the uniform case. 

Obviously, there exist many other types of source con- 
straints. Some of these have constrained capacities similar 
to those of the probabilistically shaped discrete constellations 
shown in Fig. HJ with a peak at a finite power and a relatively 
weak asymptotic performance, but other classes of sources can 
be conceived that are better suited to this nonlinear channel at 
high transmitted power. However, instead of designing further 
constrained sources, we will now proceed to study the channel 
capacity, which is the main concern of this paper. 

D. Channel Capacity 

By optimizing the mutual information over unconstrained 
source distributions fi(P), according to the method outlined 
in Sec. lIV-Cl we obtain the channel capacity ((6]). As mentioned 
in Sec. H2 the channel capacity is a property of the channel 
alone, not the source, so there exists just one channel capacity 
curve for a given channel. 

This channel capacity is shown in Fig. [5] for the studied 
channel. As promised by the Law of monotonic channel 
capacity (Theorem [TJ, the curve does not have any peak at 
a finite P, which characterizes most mutual information and 
constrained capacity curves. The channel capacity follows 
the mutual information of the Gaussian distribution closely 
until around P = 100. However, while the Gaussian case 
attains its maximum mutual information I(X;Y) = 2.44 
bits/symbol at P = 130 and then begins to decrease, the 
channel capacity continues to increase towards its asymptote 
lmip^oo C(P) = 2.54 bits/symbol. 

This asymptotical channel capacity can be explained as 
follows. Define the random variable A = a(X). Since a(-) is a 
continuous, strictly increasing function, there is a one-to-one 
mapping between X e (—00,00) and A e (— a max , a max ). 
Thus I(X; Y) = I(A; Y), where Y = A + Z. This represents 
a standard discrete-time AWGN channel whose input A is 
subject to a peak power constraint. The constrained capacity 
of a peak-power-limited AWGN channel was bounded already 
in HI Sec. 25] and computed numerically in 1481 , where 
it was also shown that the capacity-achieving distribution is 
discrete. The asymptote in Fig. [5] which is 2.54 bits/symbol 



or, equivalently, 1.76 nats/symbol, agrees perfectly with the 
constrained capacity in l48l Fig. 2] for a max /az = 10. 

Some of the (almost) capacity-achieving source distributions 
are shown in Fig. [6] numerically optimized as described in 
Sec. lIV-Cl As mentioned, the topography of L as a function of 
the source parameters for a given P includes vast, almost fiat, 
fields, where many source distributions yield the same mutual 
information, within a numerical precision of 2-3 decimals. 
For P = 10, the optimized discrete source is essentially a 
nonuniformly sampled Gaussian pdf, and the obtained channel 
capacity, 1.61, has the same value as the mutual information of 
a continuous Gaussian pdf, shown in Fig. [2] For P = 100 and 
1000, the distribution is more uniform in the range where the 
channel behaves more or less linearly, which for this channel 
is approximately at — a max /2 < x < a max /2, with some 
high-power outliers in the nonlinear range \x\ > a max . In 
all cases, increasing the number of particles N from what 
is shown in Fig. [6] does not increase the mutual information 
significantly, from which we infer that these discrete sources 
perform practically as well as the best discrete or continuous 
sources for this channel. 

Although the capacity-achieving distributions would look 
quite different for other types of nonlinear channels, a general 
observation can be made from Fig. [6] Even at high average 
power, the source should generate samples with moderate 
power, for which the channel is good, most of the time. The 
high average power is achieved by some samples having a very 
large power but small probability. The two-part distribution 
dHJ, which was used to prove Theorem Q] can be seen as a 
theoretical counterpart of this practical design principle. 

V. Conclusions 

For many nonlinear channels, common performance mea- 
sures such as bit error rate, mutual information, and con- 
strained capacity begin to degrade when the transmitted power 
is increased high enough. The contribution of this paper is 
to prove that the channel capacity, as defined by Shannon, 
behaves entirely differently for all channels: Increasing the 
average power can never degrade the channel capacity, since 
the adverse effects of high power can always be compensated 
for by suitably adjusting the source distribution. Until now, the 
prevalent paradigm in optical information theory has suggested 
the opposite. 
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This general result holds regardless of whether the mutual 
information is maximized over all source distributions with 
a given average power or all distributions with at most the 
given average power. Indeed, we have shown that the two 
optimization rules always yield the same channel capacity. 
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